1 How did your question change, if at all, after EDA?

At the beginning we asked a few questions, such as “which race is the majority of the sample? Are patient from a certain race?” In the EDA study, we deleted the last sentence. Because this is an overall study of the COVID-19 epidemic in different regions of the United States, not a study of individual individuals. We cannot determine the race of each confirmed individual.

We also deleted “which race has the most average death rate and total cases” The reason is the same as above, because we cannot determine the situation of each individual and cannot make statistics on this problem. We can only observe the correlation coefficients between total cases, death and different proportion of races based on the correlation coefficient graph. Therefore, we changed the question to “The proportion of which race is related to the number of confirmed cases and deaths”.

We added a few more questions, “Are the total cases related to age/gender/Poverty?” We first divided total cases into four levels, and then found that the average values of these variables at different levels are significantly different, so we determined they are related to total cases.

We also set another question at the beginning, “Have there been any general trends among the health conditions?”. Studies have shown that the correlation coefficient between health (such as sleep status, medical history of various diseases, smoking, obesity, etc.) and death is not large. Only the correlation coefficient between liver_total_death and death is relatively high.

We deleted the question “Are there any common underlying health conditions?” and changed it to “Does any disease relate to the death rate?”.

2 Based on EDA can you begin to sketch out an answer to your question?

2.1 United States COVID-19 Cases and Deaths by Provinces (Cities)

2.1.1 What are the top 15 Provinces based on the number of cases?

The following bar chart shows the top 15 cities by number of Covid-19 cases.

The above Bar chart shows the top 15 provinces determined by the number of cases. New York province is highest city with number of covid19 cases, the total number is over 100000, while the number of cases in other cities is less than 30000.

2.1.2 What are the top 15 Provinces based on the number of deaths?

The following bar chart shows the top 15 cities by number of deaths.

The above Bar chart shows the top 15 provinces determined by the number of deaths. New York province is highest city with number of deaths around 8000, while the number of deaths in other cities is less than 1000.

2.1.3 What are the top 15 States based on the number of Tests?

The above Bar chart shows the top 15 States determined by the number of tests. It can be clearly seen that the number of tests has been done in New York State is around 499,143 tests which is considered to be the highest among the other states. Furthermore, the number of test has been done in other states is less than 200k.

2.1.4 What is the average cases for each State?

                  State total_cases
1               Alabama       59.03
2                Alaska        9.83
3               Arizona      258.60
4              Arkansas       19.44
5            California      437.43
6              Colorado      122.41
7           Connecticut     1682.00
8              Delaware      638.33
9  District of Columbia     2058.00
10              Florida      323.07
11              Georgia       85.74
12               Hawaii      101.60
13                Idaho       33.32
14             Illinois      227.48
15              Indiana       94.12
16                 Iowa       19.20
17               Kansas       13.84
18             Kentucky       17.32
19            Louisiana      335.34
20                Maine       45.88
21             Maryland      394.75
22        Massachusetts     1843.87
23             Michigan      316.96
24            Minnesota       19.10
25          Mississippi       37.68
26             Missouri       40.98
27              Montana        7.18
28             Nebraska        9.48
29               Nevada      184.35
30        New Hampshire      103.50
31           New Jersey     3196.29
32           New Mexico       40.82
33             New York     3274.52
34       North Carolina       51.20
35         North Dakota        6.45
36                 Ohio       82.81
37             Oklahoma       28.51
 [ reached 'max' / getOption("max.print") -- omitted 14 rows ]

2.1.5 What is the average deaths for each State?

                  State  deaths
1               Alabama   1.701
2                Alaska   0.172
3               Arizona   7.133
4              Arkansas   0.427
5            California  13.328
6              Colorado   5.109
7           Connecticut  83.375
8              Delaware  14.333
9  District of Columbia  67.000
10              Florida   7.836
11              Georgia   3.270
12               Hawaii   1.800
13                Idaho   0.750
14             Illinois   8.510
15              Indiana   4.207
16                 Iowa   0.444
17               Kansas   0.657
18             Kentucky   0.900
19            Louisiana  15.922
20                Maine   1.250
21             Maryland  12.667
22        Massachusetts  49.600
23             Michigan  21.133
24            Minnesota   0.920
25          Mississippi   1.366
26             Missouri   1.284
27              Montana   0.143
28             Nebraska   0.161
29               Nevada   7.059
30        New Hampshire   0.300
31           New Jersey 133.476
32           New Mexico   0.939
33             New York 174.871
34       North Carolina   1.130
35         North Dakota   0.151
36                 Ohio   3.705
37             Oklahoma   1.403
 [ reached 'max' / getOption("max.print") -- omitted 14 rows ]

2.1.6 Which cities had the greatest % of population of people with poor health?

2.2 Patient Demographics

2.2.1 What are the patient demographics?

[1] "D:/study/6101/proj"
Table: Statistics summary.
TC Population young old black AIAN Asian NH Hispanic NHW Female Poverty Social
Min 0 88 0.0 4.8 0.0 0.0 0.0 0.0 0.6 2.7 26.8 3.4 0.0
Q1 2 11034 20.1 16.3 0.7 0.4 0.5 0.0 2.4 64.7 49.4 11.4 8.2
Median 9 25758 22.1 19.0 2.2 0.6 0.7 0.1 4.4 83.5 50.3 14.8 11.1
Mean 191 105871 22.1 19.3 8.8 2.4 1.5 0.1 9.6 76.2 49.9 15.9 11.6
Q3 39 67013 23.8 21.8 9.6 1.3 1.4 0.1 9.9 92.3 51.0 19.0 14.4
Max 110465 10105518 42.0 57.6 85.4 92.5 43.4 48.9 96.4 97.9 56.9 48.6 52.3

From the average of the output results, we can see that the average proportion of teenagers under the age of 18 is 22.1%, and the average proportion of people over 65 is 19.3%. The largest number of all races is Non-Hispanic White, with an average proportion of 76.2. The average proportion of women is 49.9, the average proportion of the poor is 15.9%, and the average of the Social Association Rate is 11.6. We divide the data into four levels according to total cases.

2.2.2 Which race is the majority of the sample?

According to the average value, we get a pie chart of race proportions, from which we can see the overall proportions of different races.

2.3 Stay at home policy in each province

2.4 Underlying Health Conditions

2.4.1 Does any disease relate to the death rate?

It shows liver_total_death is highly correlated to deaths at correlation = 0.4338.

2.5 Impact of Temperature

2.5.1 Does the temperature relate the Total Cases or Death Rate?

tibble [3,144 x 8] (S3: tbl_df/tbl/data.frame)
 $ Province    : chr [1:3144] "New York City" "Nassau" "Suffolk" "Westchester" ...
 $ State       : chr [1:3144] "New York" "New York" "New York" "New York" ...
 $ days        : num [1:3144] 44 41 38 43 82 35 41 80 39 38 ...
 $ total_cases : num [1:3144] 110465 25250 22691 20191 16323 ...
 $ deaths      : num [1:3144] 7905 1001 608 596 577 ...
 $ temp_peak   : num [1:3144] 8.41 7.41 6.86 5.88 2.25 ...
 $ temp_before : num [1:3144] 8.33 7.78 7.1 6.68 2.55 ...
 $ temp_current: num [1:3144] 9.23 8.36 7.82 7.53 3.34 ...

By the correlation diagorm, the temperature is less relate to total_cases and deaths.

[1]      0      2      9     39 110465

3 How did you select and determine the correct model to answer your question?

3.1 Linear model

[1] "D:/study/6101/proj"


Call:
lm(formula = deaths ~ Population.Density + GDP + SHP + sleep_hour + 
    poorhealth, data = lineardf3)

Residuals:
   Min     1Q Median     3Q    Max 
-195.5   -6.3   -1.5    3.0  914.0 

Coefficients:
                         Estimate     Std. Error t value             Pr(>|t|)
(Intercept)        -41.1848758118   6.8018600120   -6.05        0.00000000161
Population.Density   0.0022486139   0.0006386507    3.52              0.00044
GDP                  0.0000006479   0.0000000319   20.29 < 0.0000000000000002
SHP                  1.0518973957   0.2137480942    4.92        0.00000091424
sleep_hour           1.5090360066   0.2395795052    6.30        0.00000000035
poorhealth          -1.2298052365   0.2083461901   -5.90        0.00000000404
                      
(Intercept)        ***
Population.Density ***
GDP                ***
SHP                ***
sleep_hour         ***
poorhealth         ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 36.8 on 2584 degrees of freedom
Multiple R-squared:  0.238, Adjusted R-squared:  0.236 
F-statistic:  161 on 5 and 2584 DF,  p-value: <0.0000000000000002
Population.Density                GDP                SHP         sleep_hour 
              1.22               1.24               1.29               1.67 
        poorhealth 
              1.81 

We use the regsubsets function, exhaustive method, to find the best model from two perspectives: BIC and adjusted R-squared. Both methods point to the same model, which contains five variables: Population Density per Square mile of Land, GDP (2018),% Severe Housing Problems, Sleep <7 Hours_Percent,% Fair or Poor Health. From the p-value, these five variables are all significant. VIF shows that these five variables have no high degree of autocorrelation and can be left in the model. The adjusted r-squared is 0.236, indicating that the model explained 23.6% of the variation in death.

Final model: death=-41.185+0.002 Population.Density + 0.0000006479 GDP + 1.051 SHP +1.509 sleep_hour + -1.230 poorhealth

3.2 LASSO Regression

Because there are many variables, Lasso regression is chosen to fit the best model. Lasso regression can change the coefficients of many variables to 0, which plays a role in variable selection.

[1] "D:/study/6101/proj"

lowest lamda from CV:  0.00246 

We see that the lowest MSE is when \(\lambda\) appro = 0.002.

Mean MSE for best Lasso lamda:  0.203 

All the coefficients : 
       (Intercept)         population              young                old 
          -0.00301            0.26499            0.03709            0.02258 
             black               AIAN              Asian                 NH 
           0.00000           -0.00339           -0.09691           -0.00689 
          Hispanic                NHW             Female              Rural 
          -0.00461            0.00751           -0.01883            0.02263 
Population.Density 
           0.11744 

The non-zero coefficients : 
       (Intercept)         population              young                old 
          -0.00301            0.26499            0.03709            0.02258 
              AIAN              Asian                 NH           Hispanic 
          -0.00339           -0.09691           -0.00689           -0.00461 
               NHW             Female              Rural Population.Density 
           0.00751           -0.01883            0.02263            0.11744 

From LASSO regression, the coefficients of 11 variables are not zero, the coefficients of the remaining variables become zero. From the results, we can see that race, gender, age, population, population density and rural proportions will all have an impact on total cases.

We then calculate the R squared of lasso regression, which is 0.164.